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ABSTRACT 

In the domain of online advertising, our aim is to serve the 
best ad to a user who visits a certain webpage, to maximize 
the chance of a desired action to be performed by this user 
after seeing the ad. While it is possible to generate a differ¬ 
ent prediction model for each user to tell if he/she will act 
on a given ad, the prediction result typically will be quite 
unreliable with huge variance, since the desired actions are 
extremely sparse, and the set of users is huge (hundreds of 
millions) and extremely volatile, i.e., a lot of new users are 
introduced everyday, or are no longer valid. In this paper 
we aim to improve the accuracy in finding users who will 
perform the desired action, by assigning each user to a clus¬ 
ter, where the number of clusters is much smaller than the 
number of users (in the order of hundreds). Each user will 
fall into the same cluster with another user if their event 
history are similar. For this purpose, we modify the proba¬ 
bilistic latent semantic analysis (pLSA) model by assuming 
the independence of the user and the cluster id, given the 
history of events. This assumption helps us to identify a 
cluster of a new user without re-clustering all the users. We 
present the details of the algorithm we employed as well as 
the distributed implementation on Hadoop, and some initial 
results on the clusters that were generated by the algorithm. 
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; 1.5.3 [Computing Methodologies]: Pattern Recogni¬ 
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1. INTRODUCTION 

In online advertising, the goal is to find the best ad under 
constraints for a given user in an online context. The con¬ 
text varies based on the applicable ad format (e.g., banner or 
display ad, video ad) and the device (e.g., desktop, mobile) 
the user is using at the time of the ad impression (show¬ 
ing the ad to user). The constraints are mainly imposed by 
advertisers on the target user and context parameters, e.g., 
users of certain age range visiting webpages of certain cat¬ 
egories. The expectation of an advertiser is one of: brand 
recognition, clicks, or actions. Actions are advertiser defined 
and can be one of inquiring about or purchasing a product, 
filling out a form, visiting a certain page, etc. [8]. 

A certain ad from an advertiser can be shown to a user 
in an online context only if the value for the ad impression 
opportunity is high enough to win in a real-time, compet¬ 
itive auction [2[. Advertisers directly or indirectly through 
demand-side platforms, entities that work on behalf of ad¬ 
vertisers to deal with real-time bidding ad exchanges, signal 
their value via bids. The bid for an ad impression is calcu¬ 
lated as the action probability given a user in a certain online 
context multiplied by the cost-per-action goal an advertiser 
wants to meet or beat. 

The action probability computation is about computing 
the likelihood of an action by a certain user on a certain 
ad on a certain webpage. Since it is impossible for every 
user to see every ad on every webpage, the likelihood is 
almost always unknown. As a result, we need to use cam¬ 
paign/website/user hierarchies and other techniques to re¬ 
duce this extreme sparsity and compute this extremely rare 
event [8]. The idea can be explained best by an example: 
The likelihood of a user visiting a certain webpage can be 
unknown or zero but we may have better confidence if we 
consider the likelihood of the same user visiting the top level 
domain that the webpage belongs to (e.g. finance page of a 
news website vs. any page under the same news website). 
The example that applies to this paper is along the user di¬ 
mension: The likelihood of a user acting on a certain ad on 
a certain webpage can be unknown but we may approximate 
the likelihood if we consider the actions of similar users on 
the same or similar webpages or the top level domains they 
belong to. 

User similarity can be deduced from user clustering. User 
clustering in turn can be done utilizing user attributes such 
as age, gender, income, geographic attributes; however, most 
of the user attributes come from third-party data providers, 
who provide the data to advertisers at a cost (and only for 
a subset of users). As a result, we have to resort to using 



user attributes that do not come from these data providers. 

In this paper, we propose to use the event history of users 
to cluster users into multiple classes and use the class id as 
a feature combined with other features, such as advertiser 
and publisher properties, to calculate the action probability 
at the time of ad serving. This clustering process, named 
user segmentation in the advertising domain, is receiving 
increasing attention. We present current efforts in literature 
in §[2] Please see Figure [T] for a description of the problem 
we are attempting to solve. The event history is built by the 
event tracking process. When a user performs the designed 
action in the advertiser conversion page, a small piece of java 
script code, called beacon or tracking pixel, fires an event 
to the ad serving system to report that a desired action 
has been fulfilled. Since we use the beacon id to identify 
each event in our implementation, we will refer to events as 
beacons for the rest of the paper. 



Figure 1: Problem Description 

The methodology we propose in this paper is inspired 
highly by probabilistic latent semantic analysis (pLSA) 
which has been used in document clustering, scene classifi¬ 
cation [3] , and user clustering for news recommendation 3]. 
However, there is a significant difference between pLSA and 
our approach. While in pLSA the user and the item (e.g., 
recommended item for recommender systems, or generated 
words for document clustering) are independent of each other 
given the cluster information, in our case, the cluster id and 
the user id are independent of each other given a previous 
event of a user. Such a difference actually helps to determine 
the cluster which a new user belongs to at run-time just by 
looking at its beacon history. On the contrary, pLSA re¬ 
quires a huge amount of memory to store the cluster id for 
each existing user, and has to recompute the topic model 
from scratch for new users. Although the modification pro¬ 
posed for pLSA is pretty general and can be applied to mul¬ 
tiple domains, we believe it especially is useful for online 
advertisement domain (due to the dynamic nature, i.e. con¬ 
stantly evolving set of users). This modification also sig¬ 
nificantly reduces the computation of the expectation maxi¬ 
mization (EM) algorithm that is used to learn the clustering 
parameters. 

The rest of the paper is as follows. We first give the rele¬ 
vant previous work for both user clustering, as well as its dif¬ 
ferent applications, in §[2] Then, we introduce our method¬ 
ology, as well as providing the differences compared to pLSA 
and our motivation in these differences in § [3] We give the 
implementation details in § [4] The initial results of our clus¬ 
tering methodology is provided in § [5] Finally, we conclude 
the paper and propose future work in § [6] 


2. PREVIOUS WORK 

User clustering (or user segmentation) for better perfor¬ 
mance in advertising systems is a subject that has gained 
increasing attention in the past years. An analysis of how 
user segmentation can help click-through rate (CTR), i.e. 
the percentage of the advertisements that are clicked by a 
user, has been presented in na, which justifies this interest. 

One of the most common methodologies for grouping sim¬ 
ilarly behaving users in the online advertising domain is the 
utilization of topic models. !16] introduces Probabilistic La¬ 
tent Semantic User Segmentation (PLSUS) for advertising, 
where each user is represented as a bag of words according 
to his/her search queries. Their methodology applies pLSA, 
hence suffers from the same problems aforementioned (new 
users and cost of computation). Latent Dirichlct Allocation 
(LDA) is also another type of topic model that is used com¬ 
monly to cluster users for many applications. [5] gives an 
example where web users are clustered, assuming the users 
are the documents and the visited websites are the words. 
Furthermore, nu gives an application of LDA for online ad¬ 
vertising (but on a small, sampled set of users). The main 
problem with LDA is the computational complexity to train 
them, and the difficulty in assignment of a new user to a 
cluster (topic). While works such as [12] present distributed 
implementations of Latent Dirichlet Allocation and the Hi¬ 
erarchical Dirichet Process as fast ways to train topic mod¬ 
els, implementation overhead (cost of human resources for 
implementation, as well as scalability issues that affect ap¬ 
plicability on a real advertising scenario) is not negligible. 

Other than directly segmenting the users, there has also 
been work on estimating clicks or actions via the user prop¬ 
erties. The first case we would like to list, similar to our 
work, is given in [T|, where the authors utilize user actions 
to classify users into two groups for each advertiser: con¬ 
verters (will purchase, or click etc.) and non-converters. 
They utilize support vector machines (SVM), and train this 
model using the past actions of users (similar to the beacons 
we mention in this paper). The problem with this approach 
is that it effectively classifies users into two groups, and such 
grouping may not be valid for all campaigns (i.e. converter 
for an ad for a certain product may not be a converter for 
another product). Our clustering approach aims at sepa¬ 
rating users to multiple groups, and calculating action/click 
probabilities for each group, over many campaigns. 

Another approach in utilizing user information for adver¬ 
tising is given in [10], where the authors employ friendship 
data (due to similarity of friends, i.e. homophily ) as well 
as other information that arises in a large online social net¬ 
working site. This kind of information however is often pro¬ 
prietary, and not available to most online advertisers. The 
paper in [7j aims at incorporating user demographic infor¬ 
mation into a text representation. This representation is 
later utilized for matching relevant ads to both ad and pub¬ 
lisher’s textual properties to enhance advertising efficiency. 
While the authors show a CTR lift due to their methodol¬ 
ogy, it is not certain whether such matching is feasible or 
useful in online advertising systems where there is only lim¬ 
ited demographic information for users as well as publishers 
(textual features such as title, keywords etc. are not always 
available, but are crucial for the methodology presented in 
the paper). 

Finally, in mobile advertising domain, m suggests the us¬ 
age of mobile user patterns and therefore segmenting users 






based on the pattern similarity. The authors utilize catego¬ 
rization of location and activities as well as a constraint- 
based Bayesian Matrix Factorization model to deal with 
sparsity. Main problems with this approach are due to pri¬ 
vacy and scalability, i.e. the lack of detailed activity infor¬ 
mation available to advertising systems, and the number of 
users in the advertising domain that make it extremely hard 
for such methods to be applied. 
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3. METHODOLOGY 

As aforementioned, our clustering algorithm is highly in¬ 
spired by the probabilistic Latent Semantic Analysis (pLSA) 

, with a difference in the independence assumption of user 
and cluster, given the history of the user. In this section, 
we first give an introduction to pLSA, and then give the 
difference of our model and our motivation to why we chose 
that path. We conclude this section with the algorithmic 
and mathematical details of the model we utilized. 


3.1 Probabilistic Latent Semantic Analysis 

Probabilistic latent semantic analysis [6] assumes the ex¬ 
istence of a latent class variable (i.e. clusters, or topics), and 
that each document has a distribution over the set of topics. 
Furthermore, it assumes that given the topic information, 
probability of a word being generated by a document is in¬ 
dependent of that document’s id: 

^(worditldocumentj, topic;) = p(wordk|topiC;). (1) 

This makes pLSA a really powerful generative model. That 
is, if we want to generate a new word from a document d, 
we need to first generate the topic, or cluster c (according 
to p(c|d)), and then pick up a word w (according to p(w|c)). 
Mapping of this to a recommendation system is by first se¬ 
lecting the cluster for a user (document) and then generating 
the item (word, i.e. recommendation) from this cluster. 

The parameters of the pLSA model is learned using the 
expectation maximization algorithm [6] as follows: 


• Expectation Step: Update the probability of a clus¬ 
ter given the document and word: 


p(d\dj,w k ) 


p{cj,dj,w k ) 

T, Cm P(Cm,dj,W k ) 


p{w k \cj) p(cj\dj) p(dj) 
J2 Cm P( W k\ C m) p(c m \dj) p(dj) 


p(wk\d) p(ci.\dj) 
T,c m P( W k\Cm) p(Cm\dj) 


p(d\dj) 


E rot p{ci,dj,w k ) 


£ Cm 

£m fc P(°m, d j,W k 

) 

y 

^w k 

p(dj,Wk ) p{a\dj, 

w k ) 


p(dj) 


y 

^w k 

ap{dj,w k ) p{ci\dj 

i ,w k ) 


ap(dj) 


y 

^w k 

n(dj, w k ) p(ci\dj, 

Wk) 


n{dj) 


where n(dj) is the number of times we have seen a 
word from document dj in the training data. This 
number arises when a is again taken to be the total 
number of word-document co-occurences (i.e. p(dj) = 
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As aforementioned, the pLSA model is a very appropriate 
model for recommendation systems, since it is capable of 
suggesting a new item (i.e. one that the user has not seen 
before) to a user, due to the assumption that given the clus¬ 
ter id, the user and the item sets are independent of each 
other. But it comes with a cost, first being the necessity to 
store the cluster information for each user, and second the 
problem of determining cluster for the new users. In most 
recommendation systems, the user sets are quite stable (i.e. 
not too many users come and go), hence it is usually unlikely 
we will have to determine a user’s cluster after the training 
period is over. Even if the item lists change frequently, it is 
a much easier task to update the item probabilities for each 
cluster (a single maximization step), as long as the user sets 
do not change too frequently, in which case we need to run 
the whole EM-step. 

We will next give the difficulties that arise in the domain 
of online advertising, hence the need for a different approach. 


• Maximization Step: Update the conditional proba¬ 
bilities of words given clusters, and clusters given doc¬ 
uments: 
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3.2 Challenges of Online Advertising 

As we have explained in the previous subsection, pLSA 
works well in recommending new items, given the users are 
stable and we know the cluster id for these users. In online 
advertising however, we try to provide to all the users over 
all the web (hence no subscription system), and the user ids 
are much more volatile, since a web user’s id is based on 
the cookies on his/her device. If these cookies are deleted, 
no longer is his cluster id valid, since the user id no longer 
exists as well. Also note that this causes many new users 
arrive in (or leave) the ecosystem every day. What we need 
is a system that looks at the history of the user’s events and 
assign it to a cluster at the ad-serving time (i.e. run-time). 














Another problem in online advertising domain is the fact 
that we are not trying to recommend new items to users, at 
least not in the context of this paper. What we are trying to 
do is to be able to understand the action/click probability of 
a user, given that it belongs to a certain cluster (i.e. if a user 
belongs to a cluster due to his/her events, we want its cluster 
id, not what other events (beacons) it will perform, i.e. a 
generalization given its past events (beacons)). Hence we 
want a model which only looks at the event history of a user 
(not its user id, since it is too volatile) and give the group 
(cluster) id (this also covers all the new users, as long as 
we know their beacon history), so that we can calculate the 
action probability of that group (cluster). This also gives 
us the biggest difference of our methodology compared to 
pLSA: Given the beacon history, the user is independent of 
the cluster id. This difference, as well as others that arise 
from this, are given in the next subsection. 

3.3 Differences of Our Methodology 
Compared to pLSA 

The biggest difference of the clustering method we employ 
compared to pLSA is in the model. Keeping with the terms 
of online advertising, while pLSA assumes the independency 
of the beacons given the user id, as long as the cluster id is 
known, we assert that the cluster id is independent of user 
id, given the beacon id. This independence assumption in 
the model helps us in determining the cluster of a new or 
existing user at run-time due to its beacon history, therefore, 
we do not need to store any cluster id for any user. Since 
now we can determine the cluster of a user at run-time, 
an ad-serving system tries to calculate the action or click 
probability of this user given a specific ad on a specific web¬ 
site (publisher) as u What is the action/click probability of a 
user belonging to cluster i for this ad at this website?'". 

Please see the difference of the models as a plate notation 
in Figure [2] In the figure, it can be seen that the user 
notation could be directly removed from our model, and 
hence its relationship with pLSA becomes pretty weak (for 
good reason too, due to the nature of online advertising as 
we listed in previous section). Furthermore, our model is 
pretty general and can be applied to different application 
domains, however it is especially necessary and useful in 
online advertising due to constantly evolving set of users. 

Our proposed formulation naturally also brings some dif¬ 
ferences in the EM algorithm that needs to be run for train¬ 
ing model probabilities. As an obvious example of this, we 
had mentioned previously that the expectation step of pLSA 
was calculating p(a\dj, Wk) which is the conditional proba¬ 
bility of a cluster i given document j and word k, which can 
be interpreted into the online advertising domain as condi¬ 
tional probability of cluster i given user j ( Uj ) and beacon 
k (ft*.). But due to our model’s independence assumption, 
p(d\uj, bk) = p(ci\bh). The details of the EM algorithm we 
employ, as well as the determination of cluster for a user at 
run-time is given in the next subsection. 

3.4 Training the Model and Run-time Cluster 
Determination for Users 

To be able to determine the cluster of a user at run-time, 
we need to have access to the beacon history of this user. 
The beacon history of a user is the set of beacons (events) 
that have been performed by this user in the previous x days 
(we took this value to be 60 days for our implementation), 




(b) Plate notation for the clustering 
method we utilized 


Figure 2: Difference between pLSA and the cluster¬ 
ing we utilized. Please notice that while pLSA as¬ 
sumes that the beacons and users are independent 
given the cluster id, we assume that the users and 
clusters are independent given a beacon id from the 
user’s history of events. 


e.g. h(ui) = {bi 7 1 , bi 7 2 ,..., bi,n} for user i, if it has performed 
n actions/beacons. Not all of these n beacons have to be 
unique, hence we can come up with a probability distribution 
such as p(bj\m) = where n(bj,Ui) is the number of 

times user i has seen beacon j, and n(ui) is the total number 
of beacons that user i has seen. Therefore, in run-time, we 
calculate below: 

P{ci\v,j) = 5^p(ci, b k \uj) = ’Y^p{a\bk)p{b k \u j ) (2) 

b k b k 

for each cluster i, and determine the cluster that gives the 
maximum conditional probability to be the cluster that user 
j belongs to. Since the beacon probabilities given users are 
known during run-time, all we have to store is the beacon to 
cluster mapping (i.e. p(ci\b k ) for all beacon cluster pairs). 
Please note that this requires much less storage than keeping 
cluster ids for each user (^500 million users, while we take 
~600 clusters, due to our previous experience with different 
numbers of clusters, and around 13,000 beacons, and not 
every beacon has above zero conditional probability for each 
cluster). It is also robust to changing beacon history for each 
user, as well as new users (whose beacons might have been 
fired elsewhere, or these users were not present when model 
training was being performed). 

The EM algorithm is trying to maximize 

rivuj rTvcj where p(ci\uj )>0 ) 

= n Vu , n VCl where p(C4 |«,)>o T, bk P(d | b k )p(b k \ Uj ) , per Eq. [2] 
This process is directed by the above cluster determination 
policy, and is as follows: 

• Expectation Step: For each user in the training 
data, calculate the probability of this user belonging 
to a cluster i: 

P(ci\uj) = ^p(ci|6 fc )p(&fc|uj), 

b k 


















which can further be written as: 
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where p(b k ) is the probability of beacon k, and is cal¬ 
culated as p(6 fc ) = ^ n( n"„ 6 n fc ,i m ) ’ wli ere n{u n ,b k ) 

is the number of times the beacon k has been seen by 
user n in the training data. 


• Maximization Step: In the maximization step, we 
calculate the values p(ci) and p(b k \ci): 

P{d) = P(llj) p(Ci\Uj), 

U j 

where p(a\uj) is the output of the expectation step. 
Furthermore, the marginal probability for the user Ui 
is p(ui) = ^ _ where n(u ;) is the number of 

' T.\/ Uj n ( u 3> v ' 

beacons «; has fired. We also calculate: 
p{b k \ci) = ^p(b k \uj)p{uj\a) 


= E^i^) 


pjdluj) p( Uj ) 

p{ci ) 


where p(ci\u,j) is the output of the expectation step, 
and p(d) is calculated as part of the maximization 
step. 


The above formulation means that in the expectation step, 
we assign users to clusters (according to beacon cluster map¬ 
ping, and since we do not have this at the beginning, ran¬ 
domly); and in the maximization step we calculate beacon 
to cluster mapping parameters, as well as the cluster prob¬ 
abilities. This goes on until convergence (i.e. the cluster 
probabilities as well as beacon to cluster probabilities no 
longer change significantly). This scheme is detailed in Al¬ 
gorithm [Tj One detail in the algorithm is the difference 
between hard clustering and soft clustering. In hard clus¬ 
tering, we assign the user to the most likely cluster a, mak¬ 
ing p(a\u) = 1 (similar to run-time cluster determination); 
while in soft clustering we assign the user to multiple clusters 
where J2 Ci P( c i\ u ) = 1 - 

At the end of EM process, we output p(d\b k ) for each 
beacon cluster pair, to be used in run-time user to cluster 
determination. 


4. IMPLEMENTATION DETAILS 

We utilized Apache Pig m to implement the clustering 
algorithm outlined in this paper. Pig is a high-level language 
working on top of Hadoop [15] framework, hence makes our 
clustering task complete much quicker due to parallelization. 
As stated in the previous section, the clustering task starts 
with determining cluster id for each line (i.e. data for a sin¬ 
gle user) in the user information folder, that contains p(uj) 
as well as all the set of unique beacons this user has encoun¬ 
tered, with their conditional probabilities ( p(b k \uj )). This 
user data is pre-calculated to compress the whole training 
data, as well as to reduce the time for a single EM step. The 
pig script section for this cluster determination step is given 
in Figure 3(a) In the figure, userWeight is p{uj), user is 
a bag of p(b k \uj), CLUSTERID is a user defined function 


Algorithm 1 EM Algorithm to Learn the Parameters of 
the User Clusters_ 

learnClusterParameters 

(userSet U, beaconSet B, numClusters): 
for each user Uj G U do 

assign Uj to a random a where i G [1, numClusters] 

p(Ci\Uj) = 1 

end for 

for each cluster a do 
Poid{ci) = null 
calculate Pnew(d) 
for each b k do 
Poid(b k \a) = null 

calculate p nRW {b k \d) 

end for 
end for 
while 

{Poid(d) and p new {a) are significantly 
different for any a) 

OR 

{Poid{b k \d) and p n e-w{b k \d) are significantly 
different for any a and b k } do 
for each user Uj do 

if Soft Clustering then 
Calculate p(d\uj) for all d 
else 

Choose d for Uj where p(d\uj) is maximum 

p(d\uj) = 1 

end if 
end for 

for each cluster d do 
Pold(d') — Pnewipi) 

calculate p ne w(d) 
for each b k do 

Pold(b k | Ci ) — Pnew (b k \d) 

calculate Pnew(b k \d) 

end for 
end for 
end while 

for each cluster d do 
for each beacon b k do 

Calculate and return p(d\b k ) 

end for 
end for 


(udf) which has the inputs of p(d) and p{b k \d) as well as 
the user (bag of p(b k \uj)) and produces a bag of p(d\uj) 
(i.e. clusterlds Weights). This structure means that we can 
calculate the EM step parameters by flattening (unfolding) 
either the user bag, or clusterldsWeights. 

Another problem is the calculation of normalized proba¬ 
bilities, where at the beginning we have a weight for each 
probability, but we need the normalized values for the same 
group. For example, suppose in the below example, we 
start with a variable clusterBeacon Weight which has for each 
(■ d,b k ) pair the value ap{b k \d) (where a is not known). To 
calculate the exact value p(b k \d), we need to first group ac¬ 
cording to cluster id, then sum up p(b k \a) as the normalizing 
factor. Later, we unfold the group with this sum appended, 
and then divide by the normalizer to find the exact prob¬ 
ability. The example code piece that achieves this flow is 
given in Figure |3(b)| where the number after PARALLEL 












userClusterlds = FOREACH userData GENERATE 

userWeight, user, 

CLUSTERID(user) AS clusterldsWeights; 

(a) Code for User-Cluster Determination 

a = GROUP clusterBeaconWeight BY clusterld 
PARALLEL 400; 

— we sum up the alpha * p(b|c) 

— for each c to get normalized p(b|c) 

b = FOREACH a GENERATE 

FLATTEN($1) AS (clusterld,beaconld,weight), 

SUM($1.weight) AS clusterTotalWeight; 

c = FOREACH b GENERATE clusterld, beaconld, 
weight/clusterTotalWeight AS normalizedWeight; 

(b) Code for Probability Normalization 

Figure 3: Apache Pig code examples from our user 
clustering implementation, which should be helpful 
in understanding the parallel computation steps. 


gives the number of reducers, and the keyword FLATTEN 
is used to unfold the bag which is the set of p(bk\ci) for the 
same cluster. 

The power of using Pig, hence the Hadoop framework is 
due to assigning the cluster ids to many users in different 
machines, as well as the sum operations working on differ¬ 
ent cluster parameters at the same time. Please note that 
the algorithm (Algorithm [lj we provided for model training 
is embarrassingly parallel , i.e. at each point of training we 
can divide the dataset completely according to current for¬ 
mulation we want to compute (whether by user id, beacon 
id, or cluster id). This also means that our parallel imple¬ 
mentation is linearly scalable, i.e. it is n-times faster than 
a single machine approach at any step of training*]], where 
n is the number of machines (e.g. number of reducers for 
reducer-side operations such as grouping and counting, or 
number of mappers for mapper-side operations, such as cal¬ 
culating p(a\uj) in Algorithm [TJ . It all adds up to be able 
to learn the model feasibly for the huge number of users and 
events we deal with in online advertising domain. In our 
model, we utilized a set of ~13,000 beacons that have been 
captured over 3 months. This set has been selected by first 
removing the both tails from the whole set of beacons, i.e. 
removing the most frequent (because they are not separa¬ 
tive enough), and the least frequent (because they are too 
uncommon), and then sampling among the remaining bea¬ 
cons. This set of beacons have translated into ~500 million 
users. We have observed that the EM algorithm converge in 
about 25-30 cycles (for hard clustering). This process takes 
about 15 minutes (for a setting of ~3500 mappers, where the 
processing takes ~1.5 minutes per mapper; and ~400 reduc¬ 
ers, where the processing takes ~4 minutes per reducer) for 
each cycle in our test cluster, hence a total of 7 hours for 
the whole process. 


* In Algorithm [T] calculation of p(ci\uj) is one map-reduce 
job (mapper-intensive), and calculation of both p n ew(ci) and 
Pnew{bk\ci) is another separate map-reduce job (reducer¬ 
intensive). These two run sequentially and repeatedly until 
convergence. 


5. RESULTS 

In this section we will present our evaluation of the pro¬ 
posed clustering algorithm. We will provide both inter- 
pretability results (i.e. how the constructed clusters look 
like), as well as numerical results (i.e. how well advertis¬ 
ing with the new clustering algorithm performs) in the rest 
of the section. Please note that we are giving a prelimi¬ 
nary overview for the performance of our proposed method, 
and the reasons for non-exhaustive comparisons with other 
well-known methods are two-folds. First one is the inappro¬ 
priateness of methods like pLSA and LDA for the dynamic 
nature of the users in online advertising, as well as the scala¬ 
bility issues in LDA (for the more advanced parallel training 
methods as mentioned in § [2] we still have significant imple¬ 
mentation overhead in terms of man-hours). Second reason 
is the way we wanted to evaluate the proposed methodol¬ 
ogy. In our domain, we deploy new models into our actual 
advertising system to check their performances and any test 
means lost/gained revenue in such an industrial scenario. 
In most cases, the online test is the only feasible way to 
evaluate a new model/method, since this is the only way we 
can directly measure more relevant metrics (such as revenue, 
actions/clicks vs. cost of advertising etc.) in online advertis¬ 
ing. This is why some of the more recent work [9] on offline 
evaluation is often not useful: the metrics that can be exam¬ 
ined in such settings (such as click-through rate, AUC etc.) 
often do not correlate with online metrics. Due to this, it 
is not often possible to test many models (unless we are 
fairly confident of the performance) since it may mean lost 
revenue; this is why we limited the testing of our proposed 
model and algorithm to our previously employed algorithm 
based on cosine similarity metric. Therefore, in this section, 
we are basically evaluating how our model replacement ef¬ 
forts resulted. 

Let us start with our interpretability results, which is a 
way to manually check whether the clusters generated by 
our method make sense or not. Figure [5] gives a number of 
example clusters which were constructed by the proposed al¬ 
gorithm. Although we set our system to generate around 600 
clusters (due to our previous experience with the domain, we 
found this to be a good number of partitions for the whole 
set of users), we only give eight of those clusters for presen¬ 
tational purposes. In the figure, we only show the top three 
beacons for each cluster, and the number between paranthe- 
ses after each beacon definition gives the probability of that 
beacon being generated by a user belonging to that cluster 
(e.g. [)(!) Hotel 2 - Conversion Action | Cluster 1) — 0.12). Due 

to privacy reasons, we did not give the exact names of the 
beacons we used, but rather we gave a categorization on the 
subject of the beacon and the type of event the beacon indi¬ 
cates. For example, Homepage Action means the visit of the 
home page of an advertiser by a user, whereas a Conversion 
Action usually indicates a purchase. We furthermore give 
the number of users that had fallen under a specific cluster 
(during the clustering process) next to the cluster in paran- 
theses (e.g. during the clustering process, 661,083 users out 
of the ~ 500M users we processed fell under Cluster 5). 

The results, as a subset is given in the figure, shows that 
the clusters are mostly meaningful, i.e. the events (beacons) 
that fall under the same cluster are similar. Furthermore, 
it can be seen that they do not always belong to the same 
advertiser, but usually the same subject. For example, it can 
be seen that Cluster 7 have events/beacons from different 



Cluster 1 (486,299 users) Cluster 2 (805,853 users) Cluster 3 (871,820 users) 

Hotel 1 - Reservation Action (0.35) Touristic Location 1 - Remarketing Action 1 (0.36) Car Rental Comp. 1 - Retargeting Action (0.44) 

Hotel 1 - Directory Search Action (0.13) Touristic Location 1 - Remarketing Action 2 (0.35) Car Rental Comp. 2 - Homepage Action (0.23) 

Hotel 2 - Conversion Action (0.12) Touristic Location 2 - Homepage Action (0.10) Car Rental Comp. 3 - Homepage Action (0.08) 


Cluster 4 (353,682 users) Cluster 5 (661,083 users) 

Trousers Comp. 1 - Women's Homepage Action (0.19) Women's Clothing Comp. 1 - Catalog Action (0.21) 
Trousers Comp. 1 - Homepage Action (0.18) Women's Clothing Comp. 2 - Site Interaction (0.20) 

Trousers Comp. 2 - Homepage Action (0.16) Clothing Comp. - Site Interaction (0.14) 


Cluster 6 (1,261,022 users) Cluster 7 (280,334 users) Cluster 8 (357,532 users) 

Cruise Comp. 1 - Homepage Action (0.39) Political Campaign - Remarketing Action (0.36) Japanese Car Comp. - Homepage Action (0.25) 

Cruise Website 1 - Homepage Action (0.18) Political Group - Homepage Action (0.16) German Car Comp. - Homepage Action (0.24) 

Cruise Website 1 - Cruise Comp. 2 Action (0.10) Politics Magazine - Homepage Action (0.16) French Car Comp. - Homepage Action (0.16) 

Figure 4: Example clusters for the online advertising domain. Cluster ids are followed by the number of 
users in our system that belong to this cluster. We have listed only the three most significant beacons for 
each cluster and the numbers in parantheses give the values p(beacon|cluster). 


advertisers under it, one of them being a political campaign, 
one a political group, and one a political magazine. The 
clustering algorithm was able to gather these events under 
the same cluster since they all belong to the subject Politics 
(of course this is the expected behavior since the users that 
fall under this cluster are the ones that are interested in 
politics). Similar arguments can easily be made for the other 
clusters presented in the figure, and most of the clusters that 
are currently being used in our system, but not presented 
here. 

Next, we will give a comparison of the clustering algo¬ 
rithm against a previous system that was being used at our 
company’s advertising framework again for user segmenta¬ 
tion. Our previous algorithm also utilizes user beacons, but 
employs a cosine similarity metric (by taking the beacon set 
as a feature vector for each user) to determine which users 
are closer, or more similar, to each other, hence should be 
in the same cluster for action rate and click rate calcula¬ 
tion. In this algorithm, each beacon b; has a weight w», i.e. 
Wi = a\u(bi)\ where u{bi) is the set of users that have fired 
this beacon at least once. Furthermore, each user Uj is rep¬ 
resented by a vector where the entries for the beacons that 
have been seen by this user has the value 1, and others 0 (i.e. 
v(uj) = [vi,...,v n ] where n is the total number of distinct 
beacons in the system, and Vi = 1 if Uj has seen beacon 6;, 0 
otherwise). The algorithm starts with n centroids, where n 
is the total number of beacons available (furthermore at the 
beginning, each centroid c m is a vector v(c m ) = [t>i, ...,u n ] 
where only v m , representing beacon b m , is 1, and all other 
entries 0), and applies a k-means on these set of centroids 
(we take k = 600 similar to the new methodology). Each 
user Uj is assigned to a centroid, hence cluster, as follows: 

Cluster(v,j) = argmax CieC —rr, 

where the norm and dot product calculations take into ac¬ 
count the weights ( Wi ) of beacons ( bi ). Furthermore, the 
entries for the centroid vectors are updated during the k- 
means according to vectors of the users assigned to them. 
In this algorithm, each k-means is followed by a merging 
stage where significantly closer centroids are merged. This 
process goes until no centroid merges are possible. The dis¬ 
advantage of this algorithm is two-folds. The first one is 
the already high (and increasing) number of users in the 


advertising domain, hence the computational issues. Sec¬ 
ond one is the assumption that any beacon seen by a user 
has the same weight for that user (i.e. we do not utilize 
p(beacon\user) values), which does not take into account 
the possibility that many of the same events (beacons) may 
indeed indicate higher interest on a specific type of product. 

We have run an A/B test using two models, one uti¬ 
lizing the new clusters/methodology and one the old clus¬ 
ters/methodology. These models have been run on the whole 
set of campaigns within Turn, where impression traffic was 
directed to two models with equal priority. In the models, 
the users that belong to the same cluster have the same pre¬ 
dicted action/click rates (which are utilized for calculating 
the bid value) for the same campaigns. These rates are again 
calculated from historical data of action/clicks (separately 
for each campaign) among the users that belong to the same 
cluster. Therefore, the cluster id is used as a single feature, 
alongside with the campaign id. We are providing results in 
terms of effective cost per action (eCPA) and effective cost 
per click (eCPC) metrics. These metrics can be described 
as follows: 

• Effective Cost per Action (eCPA): What is the 
average amount of money that is spent by an advertiser 
(on advertising) to receive one action (i.e. purchase 
etc.)? This metric can be calculated as: 

QPA Advertising Cost 
of Actions 

• Effective Cost per Click (eCPC): What is the av¬ 
erage amount of money that is spent by an advertiser 
(on advertising) to receive one click (on its ad)? This 
metric can be calculated as: 

_ Advertising Cost 
6 = # of Clicks ' 

Above metrics are representative of the quality of clustering 
since our bidding logic takes the cluster ids of users directly 
into consideration. If we are able to separate the users into 
meaningful segments (via clustering), then the action proba¬ 
bility calculated for each segment is more accurate (therefore 
our bid values are closer to the actual value of each impres¬ 
sion). Due to this, the money spent for each impression 



Comparison of the Two Clustering Algorithms in Terms of eCPA 



Figure 5: Comparison of eCPA Performance for the 
Clustering Algorithms over 10 Days. Lower eCPA 
that has been achieved by the proposed methodol¬ 
ogy indicates better performance. 


6. CONCLUSIONS AND FUTURE WORK 

In this paper, we described a methodology to cluster users 
in online advertising. We have explained the difficulties such 
as non-stable set of users as well as the large data that we 
have to deal with, and how those factors shape the algorithm 
we employ or how we implement this algorithm. We have 
given a brief overview of the implementation in Apache Pig, 
on Hadoop, as well as some initial experimental results. It 
is important to mention that we are not claiming to have 
explored all possible algorithms/models for analysis in this 
work, but rather that we have developed a meaningful and 
efficient system that solves a real-world online advertising 
problem and improves the performance. In summary, we 
believe that this work fills a significant void in the literature, 
since it deals directly with the large-scale problems that arise 
in online advertising domain. Our future work includes the 
extension of our results, as well as an improved clustering 
algorithm where the user properties (such as age, gender, 
location etc.) are used alongside with the item sets (i.e. 
beacon history) for better determination of clusters. 
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Figure 6: Comparison of eCPC Performance for the 
Clustering Algorithms over 10 Days. Lower eCPC 
that has been achieved by the proposed methodol¬ 
ogy indicates better performance. 


(due to bidding) brings more actions/clicks, hence improv¬ 
ing eCPC and eCPA metrics. 

The results of our initial experiments, that span over a 10 
day period within July-August 2013, are given in Figures [5] 
and [6] While Figure [5] compares the day-by-day eCPA per¬ 
formance for both algorithms, Figure [6] presents the same 
analysis for eCPC performance. Due to privacy issues, we 
are not presenting the exact values in the plots, but rather 
we have changed the values by a factor. It can be seen that 
the proposed algorithm utilizing the topic model performs 
much better (i.e. lower eCPA or eCPC is better) compared 
to the algorithm which utilizes the cosine similarity metric. 
Furthermore, we see that the eCPA has an trend for in¬ 
creasing for both models at the end of the experimentation 
period. This is due to the action attribution problem inher¬ 
ent in advertising systems, where an action that happens is 
attributed to an ad that is shown to a user several days ago. 
Due to this, the later days still have some actions that are 
to be received in time. 
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